Preparing data
Type conversion
Types of variables in R
As in other programming languages, R is capable of storing data in many different formats, most of which you’ve probably seen by now.
Loosely speaking, the class() function tells you what type of object you’re working with. (There are subtle differences between the class, type, and mode of an object, but these distinctions are beyond the scope of this course.)
# Make this evaluate to "character"
class("TRUE")## [1] "character"
# Make this evaluate to "numeric"
class(8484.00)## [1] "numeric"
# Make this evaluate to "integer"
class(99L)## [1] "integer"
# Make this evaluate to "factor"
class(as.factor("factor"))## [1] "factor"
# Make this evaluate to "logical"
class(FALSE)## [1] "logical"
Common type conversions
It is often necessary to change, or coerce, the way that variables in a dataset are stored. This could be because of the way they were read into R (with read.csv(), for example) or perhaps the function you are using to analyze the data requires variables to be coded a certain way.
Only certain coercions are allowed, but the rules for what works are generally pretty intuitive. For example, trying to convert a character string to a number gives an error: as.numeric("some text").
There are a few less intuitive results. For example, under the hood, the logical values TRUE and FALSE are coded as 1 and 0, respectively. Therefore, as.logical(1) returns TRUE and as.numeric(TRUE) returns 1.
# Read students data
library(readr)
students <- read_csv("../xDatasets/students_with_dates.csv")## Warning: Missing column names filled in: 'X1' [1]
# Preview students with str()
str(students, give.attr = FALSE)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of 33 variables:
## $ X1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ dob : Date, format: "2000-06-05" "1999-11-25" ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime : num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : num 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities : chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
## $ absences : num 6 4 10 2 4 10 0 6 0 0 ...
## $ Grades : chr "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...
# Coerce Grades to character
students$Grades <- as.character(students$Grades)
# Coerce Medu to factor
students$Medu <- as.factor(students$Medu)
# Coerce Fedu to factor
students$Fedu <- as.factor(students$Fedu)
# Look at students once more with str()
str(students, give.attr = FALSE)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of 33 variables:
## $ X1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ dob : Date, format: "2000-06-05" "1999-11-25" ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
## $ Fedu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime : num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : num 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities : chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
## $ absences : num 6 4 10 2 4 10 0 6 0 0 ...
## $ Grades : chr "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...
Working with dates
Dates can be a challenge to work with in any programming language, but thanks to the lubridate package, working with dates in R isn’t so bad. Since this course is about cleaning data, we only cover the most basic functions from lubridate to help us standardize the format of dates and times in our data.
These functions combine the letters y, m, d, h, m, s, which stand for year, month, day, hour, minute, and second, respectively. The order of the letters in the function should match the order of the date/time you are attempting to read in, although not all combinations are valid. Notice that the functions are “smart” in that they are capable of parsing multiple formats.
install.packages("lubridate")# Read students data
library(readr)
students2 <- read_csv("../xDatasets/students_with_dates.csv")## Warning: Missing column names filled in: 'X1' [1]
# Preview students2 with str()
#str(students2)
# Load the lubridate package
library(lubridate)
# Parse as date
dmy("17 Sep 2015")## [1] "2015-09-17"
# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")## [1] "2012-07-15 12:56:00 UTC"
# Coerce dob to a date (with no time)
students2$dob <- ymd(students2$dob)
# Coerce nurse_visit to a date and time
students2$nurse_visit <- ymd_hms(students2$nurse_visit)
# Look at students2 once more with str()
str(students2, give.attr = FALSE, vec.len = 8)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of 33 variables:
## $ X1 : num 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
## $ school : chr "GP" "GP" "GP" "GP" "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" "F" "M" "M" "F" ...
## $ dob : Date, format: "2000-06-05" "1999-11-25" ...
## $ address : chr "U" "U" "U" "U" "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" "GT3" "LE3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" "T" "T" "T" "A" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 4 2 4 4 2 4 4 3 3 4 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 4 1 4 3 2 4 4 3 2 3 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" "other" "services" "other" "other" ...
## $ Fjob : chr "teacher" "other" "other" "services" "other" "other" "other" "teacher" ...
## $ reason : chr "course" "course" "other" "home" "home" "reputation" "home" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" "father" "mother" "mother" "mother" ...
## $ traveltime : num 2 1 1 1 1 1 1 2 1 1 1 3 1 2 1 1 1 3 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 2 3 1 2 3 1 3 2 1 1 ...
## $ failures : num 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" "no" "no" "no" "yes" ...
## $ famsup : chr "no" "yes" "no" "yes" "yes" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" "yes" "yes" "no" "no" ...
## $ activities : chr "no" "no" "no" "yes" "no" "yes" "no" "no" ...
## $ nursery : chr "yes" "no" "yes" "yes" "yes" "yes" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" "no" "yes" "yes" "no" ...
## $ romantic : chr "no" "no" "no" "yes" "no" "no" "no" "no" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 3 5 4 5 4 4 3 5 5 3 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 3 2 3 4 5 4 2 3 5 1 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 3 2 3 3 2 4 3 2 5 3 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 2 1 3 2 1 2 2 1 4 3 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 2 4 5 3 3 2 2 4 5 5 ...
## $ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
## $ absences : num 6 4 10 2 4 10 0 6 0 0 0 4 2 2 0 4 6 4 16 4 ...
## $ Grades : chr "5/6/6" "5/5/6" "7/8/10" "15/14/15" "6/10/10" "15/15/15" "12/12/11" "6/5/6" ...
String manipulation
install.packages("stringr")Trimming and padding strings
One common issue that comes up when cleaning data is the need to remove leading and/or trailing white space. The str_trim() function from stringr makes it easy to do this while leaving intact the part of the string that you actually want.
str_trim(" this is a test ")
[1] "this is a test"
A similar issue is when you need to pad strings to make them a certain number of characters wide. One example is if you had a bunch of employee ID numbers, some of which begin with one or more zeros. When reading these data in, you find that the leading zeros have been dropped somewhere along the way (probably because the variable was thought to be numeric and in that case, leading zeros would be unnecessary.)
str_pad("24493", width = 7, side = "left", pad = "0")
[1] "0024493"
# Load the stringr package
library(stringr)
# Trim all leading and trailing whitespace
str_trim(c(" Filip ", "Nick ", " Jonathan"))## [1] "Filip" "Nick" "Jonathan"
# Pad these strings with leading zeros
str_pad(c("23485W", "8823453Q", "994Z"), width = 9, side = "left", pad = "0")## [1] "00023485W" "08823453Q" "00000994Z"
Examples like this are certainly handy in R. For example, the str_pad() function is useful when importing a dataset with US zip codes. Occasionally R will drop the leading 0 in a zipcode, thinking it’s numeric.
Upper and lower case
In addition to trimming and padding strings, you may need to adjust their case from time to time. Making strings uppercase or lowercase is very straightforward in (base) R thanks to toupper() and tolower(). Each function takes exactly one argument: the character string (or vector/column of strings) to be converted to the desired case.
# state abbreviations
states <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga", "hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", "md", "ma", "mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj", "nm", "ny", "nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx", "ut", "vt", "va", "wa", "wv", "wi", "wy")
# Make states all uppercase and save result to states_upper
states_upper <- toupper(states)
# Make states_upper all lowercase again
tolower(states_upper)## [1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il" "in"
## [15] "ia" "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt" "ne" "nv"
## [29] "nh" "nj" "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri" "sc" "sd" "tn"
## [43] "tx" "ut" "vt" "va" "wa" "wv" "wi" "wy"
Finding and replacing strings
The stringr package provides two functions that are very useful for finding and/or replacing patterns in strings: str_detect() and str_replace().
Like all functions in stringr, the first argument of each is the string of interest. The second argument of each is the pattern of interest. In the case of str_detect(), this is the pattern we are searching for. In the case of str_replace(), this is the pattern we want to replace. Finally, str_replace() has a third argument, which is the string to replace with.
str_detect(c("banana", "kiwi"), "a")
[1] TRUE FALSE
str_replace(c("banana", "kiwi"), "a", "o")
"bonana" "kiwi"
The data.frame students2 is already available for you in the workspace. stringr is already loaded. students3 is a copy of it for you to work on so you can always start from scratch if you happen to make a mistake.
# Copy of students2: students3
students3 <- students2
# Look at the head of students3
students3 %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| X1 | school | sex | dob | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | nurse_visit | absences | Grades |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GP | F | 2000-06-05 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 2014-04-10 14:59:54 | 6 | 5/6/6 |
| 2 | GP | F | 1999-11-25 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 2015-03-12 14:59:54 | 4 | 5/5/6 |
| 3 | GP | F | 1998-02-02 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 | yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 2015-09-21 14:59:54 | 10 | 7/8/10 |
| 4 | GP | F | 1997-12-20 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2015-09-03 14:59:54 | 2 | 15/14/15 |
| 5 | GP | F | 1998-10-04 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 2015-04-07 14:59:54 | 4 | 6/10/10 |
| 6 | GP | M | 1999-06-16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | yes | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 2013-11-15 14:59:54 | 10 | 15/15/15 |
# Detect all dates of birth (dob) in 1997, print 10 first results
str_detect(students3$dob, "1997")[1:10]## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
# In the sex column, replace "F" with "Female" ...
students3$sex <- str_replace(students3$sex, "F", "Female")
# ... and "M" with "Male"
students3$sex <- str_replace(students3$sex, "M", "Male")
# View the tail of students3
students3 %>%
tail(8) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| X1 | school | sex | dob | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | nurse_visit | absences | Grades |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 388 | MS | Female | 1999-05-10 | R | GT3 | T | 2 | 3 | services | other | course | mother | 1 | 3 | 1 | no | no | no | yes | no | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 2014-12-30 14:59:54 | 0 | 7/5/0 |
| 389 | MS | Female | 1999-11-19 | U | LE3 | T | 3 | 1 | teacher | services | course | mother | 1 | 2 | 0 | no | yes | yes | no | yes | yes | yes | no | 4 | 3 | 4 | 1 | 1 | 1 | 2014-08-11 14:59:54 | 0 | 7/9/8 |
| 390 | MS | Female | 1999-07-04 | U | GT3 | T | 1 | 1 | other | other | course | mother | 2 | 2 | 1 | no | no | no | yes | yes | yes | no | no | 1 | 1 | 1 | 1 | 1 | 5 | 2013-12-09 14:59:54 | 0 | 6/5/0 |
| 391 | MS | Male | 1998-06-06 | U | LE3 | A | 2 | 2 | services | services | course | other | 1 | 2 | 2 | no | yes | yes | no | yes | yes | no | no | 5 | 5 | 4 | 4 | 5 | 4 | 2015-08-06 14:59:54 | 11 | 9/9/9 |
| 392 | MS | Male | 2000-04-04 | U | LE3 | T | 3 | 1 | services | services | course | mother | 2 | 1 | 0 | no | no | no | no | no | yes | yes | no | 2 | 4 | 5 | 3 | 4 | 2 | 2014-09-01 14:59:54 | 3 | 14/16/16 |
| 393 | MS | Male | 2000-02-07 | R | GT3 | T | 1 | 1 | other | other | course | other | 1 | 1 | 3 | no | no | no | no | no | yes | no | no | 5 | 5 | 3 | 3 | 3 | 3 | 2015-03-15 14:59:54 | 3 | 10/8/7 |
| 394 | MS | Male | 1999-09-05 | R | LE3 | T | 3 | 2 | services | other | course | mother | 3 | 1 | 0 | no | no | no | no | no | yes | yes | no | 4 | 4 | 1 | 3 | 4 | 5 | 2015-06-12 14:59:54 | 0 | 11/12/10 |
| 395 | MS | Male | 1999-01-27 | U | LE3 | T | 1 | 1 | other | at_home | course | father | 1 | 1 | 0 | no | no | no | no | yes | yes | yes | no | 3 | 2 | 3 | 3 | 3 | 5 | 2015-05-31 14:59:54 | 5 | 8/9/9 |
Missing and special values
Finding missing values
As you’ve seen, missing values in R should be represented by NA, but unfortunately you will not always be so lucky. Before you can deal with missing values, you have to find them in the data.
If missing values are properly coded as NA, the is.na() function will help you find them. Otherwise, if your dataset is too big to just look at the whole thing, you may need to try searching for some of the usual suspects like "", "#N/A", etc. You can also use the summary() and table() functions to turn up unexpected values in your data.
In this exercise, we’ve created a simple dataset called social_df that has 3 pieces of information for each of four friends:
Name
Number of friends on a popular social media platform
Current “status” on the platform
# Create small Social data frame
name <- c("Sarah", "Tom", "David", "Alice")
n_friends <- c(244, NA, 145, 43)
status <- c("Going out!", "", "Movie night...", "")
social_df <- data.frame(name, n_friends, status)
# Call is.na() on the full social_df to spot all NAs
is.na(social_df)## name n_friends status
## [1,] FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
# Use the any() function to ask whether there are any NAs in the data
any(is.na(social_df))## [1] TRUE
# View a summary() of the dataset
summary(social_df)## name n_friends status
## Alice:1 Min. : 43.0 :2
## David:1 1st Qu.: 94.0 Going out! :1
## Sarah:1 Median :145.0 Movie night...:1
## Tom :1 Mean :144.0
## 3rd Qu.:194.5
## Max. :244.0
## NA's :1
# Call table() on the status column
table(social_df$status)##
## Going out! Movie night...
## 2 1 1
Scanning your dataset for NA values is essential before learning how to remedy missing data problems.
Dealing with missing values
Missing values can be a rather complex subject, but here we’ll only look at the simple case where you are simply interested in normalizing and/or removing all missing values from your data. For more information on why this is not always the best strategy, search online for “missing not at random.”
Looking at the social_df dataset again, we asked around a bit and figured out what’s causing the missing values that you saw in the last exercise. Tom doesn’t have a social media account on this particular platform, which explains why his number of friends and current status are missing (although coded in two different ways). Alice is on the platform, but is a passive user and never sets her status, hence the reason it’s missing for her.
The stringr package is preloaded.
# Replace all empty strings in status with NA
social_df$status[social_df$status == ""] <- NA
# Print social_df to the console
social_df## name n_friends status
## 1 Sarah 244 Going out!
## 2 Tom NA <NA>
## 3 David 145 Movie night...
## 4 Alice 43 <NA>
# Use complete.cases() to see which rows have no missing values
complete.cases(social_df)## [1] TRUE FALSE TRUE FALSE
# Use na.omit() to remove all rows with any missing values
na.omit(social_df)## name n_friends status
## 1 Sarah 244 Going out!
## 3 David 145 Movie night...
Often times in data analyses, you’ll want to get a feel for how many complete observations you have. This can be helpful in determining how you handle observations with missing data points.
Outliers and obvious errors
# Simulate some data with three outliers
set.seed(10)
x <- c(rnorm(30, mean = 15, sd = 5), -5, 28, 35)
# View boxplot
boxplot(x, horizontal = TRUE)Dealing with outliers and obvious errors
When dealing with strange values in your data, you often must decide whether they are just extreme or actually erroneous. Extreme values show up all over the place, but you, the data analyst, must figure out when they are plausible and when they are not.
We have loaded a dataset called students3, which is another slight variation of the original students dataset. Two variables appear to have suspicious values: age and absences. Let’s explore these values further.
# Read students data
students3 <- read_csv("../xDatasets/students_with_dates.csv")## Warning: Missing column names filled in: 'X1' [1]
# Simulate AGE and ABSCENCES variables
students3$age <- sample(15:40, size = nrow(students3), replace = TRUE)
# Look at a summary() of students3
sum_students3 <- as.data.frame(do.call(cbind, lapply(students3, summary)))
sum_students3[,-1] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| school | sex | dob | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | nurse_visit | absences | Grades | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 395 | 395 | 9802 | 395 | 395 | 395 | 0 | 0 | 395 | 395 | 395 | 395 | 1 | 1 | 0 | 395 | 395 | 395 | 395 | 395 | 395 | 395 | 395 | 1 | 1 | 1 | 1 | 1 | 1 | 1382972394 | 0 | 395 | 15 |
| 1st Qu. | character | character | 10169 | character | character | character | 2 | 2 | character | character | character | character | 1 | 1 | 0 | character | character | character | character | character | character | character | character | 4 | 3 | 2 | 1 | 1 | 3 | 1396839594 | 0 | character | 21 |
| Median | character | character | 10576 | character | character | character | 3 | 2 | character | character | character | character | 1 | 2 | 0 | character | character | character | character | character | character | character | character | 4 | 3 | 3 | 1 | 2 | 4 | 1410793194 | 4 | character | 28 |
| Mean | 395 | 395 | 10529.9291139 | 395 | 395 | 395 | 2.74936708860759 | 2.52151898734177 | 395 | 395 | 395 | 395 | 1.44810126582278 | 2.03544303797468 | 0.334177215189873 | 395 | 395 | 395 | 395 | 395 | 395 | 395 | 395 | 3.94430379746835 | 3.23544303797468 | 3.10886075949367 | 1.48101265822785 | 2.29113924050633 | 3.55443037974684 | 1412919071.46835 | 5.70886075949367 | 395 | 27.6405063291139 |
| 3rd Qu. | character | character | 10893 | character | character | character | 4 | 3 | character | character | character | character | 2 | 2 | 0 | character | character | character | character | character | character | character | character | 5 | 4 | 4 | 2 | 3 | 5 | 1428461994 | 8 | character | 34 |
| Max. | character | character | 11255 | character | character | character | 4 | 4 | character | character | character | character | 4 | 4 | 3 | character | character | character | character | character | character | character | character | 5 | 5 | 5 | 5 | 5 | 5 | 1444921194 | 75 | character | 40 |
# View a histogram of the age variable
hist(students3$age)# View a histogram of the absences variable
hist(students3$absences)# View a histogram of absences, but force zeros to be bucketed to the right of zero
hist(students3$absences, right = FALSE)As you can see, a simple histogram, displaying the distribution of a variable’s values across all the observations can be key to identifying potential outliers as early as possible.
Another look at strange values
Another useful way of looking at strange values is with boxplots. Simply put, boxplots draw a box around the middle 50% of values for a given variable, with a bolded horizontal line drawn at the median. Values that fall far from the bulk of the data points (i.e. outliers) are denoted by open circles. (If you’re curious about the exact formula for determining what is “far”, check out ?hist.)
# View a boxplot of age
boxplot(students3$age)# View a boxplot of absences
boxplot(students3$absences)